class: center, middle, inverse, title-slide .title[ # Workflows for Reproducible Research with R & Git ] .subtitle[ ## Dependency Management ] .author[ ### Johannes Breuer, Bernd Weiss, & Arnim Bleier ] .date[ ### 2023-11-17 ] --- layout: true --- ## Dependencies in `R` Most `R` packages depend on other `R` packages. All `R` packages depend on `R`. -- Both `R` and `R` packages have versions. Different versions of `R` packages may depend on different versions of `R` and different versions of other packages. --- ## Dependencies in `R` <img src="data:image/png;base64,#Dependency_Management_files/figure-html/dep-graph-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Dependencies in `R` <img src="data:image/png;base64,#../img/cran_usethis.png" width="175%" style="display: block; margin: auto;" /> <small><small>Source: https://cran.r-project.org/web/packages/usethis/index.html</small></small> --- class: center, middle ["It's ~~turtles~~ software all the way down"](https://en.wikipedia.org/wiki/Turtles_all_the_way_down) 🐢 -- ... or is it??? --- ## Digging ⛏ into your tool stack 🛠 Your full `R` setup consists of: 1. Specific versions of `R` packages<sup>1</sup> 2. A specific version of `R`<sup>2</sup> 3. A specific version of your operating system 4. Specific hardware .small[ [1] You can have different libraries with different (versions of) `R` packages. [2] You can also have different versions of `R` installed on your machine. ] --- ## Further dependencies <img src="data:image/png;base64,#../img/toppling-tower.jpg" width="30%" style="display: block; margin: auto;" /> <small><small>*Note*: We could also add system libraries between `R` and the OS (which are especially relevant in the [Linux/Unix world](https://www.tutorialspoint.com/operating_system/os_linux.htm)).</small></small> --- ## The danger of dependencies <img src="data:image/png;base64,#https://imgs.xkcd.com/comics/dependency.png" width="50%" style="display: block; margin: auto;" /> <small><small>https://xkcd.com/2347/</small></small> --- ## What to "ship" 🚢? - `R` code (+ underlying data) - this should include information about the packages used - information about the version of `R` and the used packages -- - your whole computational environment (focus of the next session) - overall goal: preventing what is known as "code rot" & "works-on-my-machine errors" (WOMME) --- ## Dependency management solutions As with almost everything in the `R` ecosystem, there are multiple solutions for dependency management: - a manual approach - [~~`checkpoint`~~](https://github.com/RevolutionAnalytics/checkpoint)<sup>1</sup> - [**`groundhog`**](https://groundhogr.com/) - [`renv`](https://rstudio.github.io/renv/)<sup>2</sup> - [`rang`](https://github.com/gesistsa/rang)<sup>3</sup> .small[ [1] Not an option anymore as it relied on the *CRAN Time Machine snapshots* from the *Microsoft R Application Network* (MRAN) which was [retired in July 2022](https://techcommunity.microsoft.com/t5/azure-sql-blog/microsoft-r-application-network-retirement/ba-p/3707161). [2] `renv` is the successor of [`packrat`](https://github.com/rstudio/packrat) (which is not maintained anymore). [3] Developed by members of the team [Transparent Social Analytics](https://www.gesis.org/en/institute/staff/orga/tiles/5/74?cHash=8fbd330b798c8dd7cb84097ddfd82054) at GESIS. ] --- ## Manual approach to dependency management There is an easy-to-use manual solution for providing information about the packages and `R` version used in your project: .small[ ```r sessionInfo() ``` ``` ## R version 4.3.2 Patched (2023-10-31 r85451 ucrt) ## Platform: x86_64-w64-mingw32/x64 (64-bit) ## Running under: Windows 10 x64 (build 19045) ## ## Matrix products: default ## ## ## locale: ## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8 LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C ## [5] LC_TIME=German_Germany.utf8 ## ## time zone: Europe/Berlin ## tzcode source: internal ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] groundhog_3.1.2 depgraph_0.1.0 emo_0.0.0.9000 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2 ## [9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 knitr_1.45 ## ## loaded via a namespace (and not attached): ## [1] remotes_2.4.2.1 rlang_1.1.2 magrittr_2.0.3 compiler_4.3.2 png_0.1-8 ## [6] systemfonts_1.0.5 callr_3.7.3 vctrs_0.6.4 rvest_1.0.3 profvis_0.3.8 ## [11] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.1 ellipsis_0.3.2 fontawesome_0.5.2 ## [16] labeling_0.4.3 utf8_1.2.4 promises_1.2.1 rmarkdown_2.25 sessioninfo_1.2.2 ## [21] tzdb_0.4.0 ps_1.7.5 bit_4.0.5 xfun_0.40 cachem_1.0.8 ## [26] jsonlite_1.8.7 highr_0.10 later_1.3.1 uuid_1.1-1 parallel_4.3.2 ## [31] prettyunits_1.2.0 R6_2.5.1 bslib_0.5.1 stringi_1.7.12 reticulate_1.34.0 ## [36] pkgload_1.3.3 jquerylib_0.1.4 Rcpp_1.0.11 assertthat_0.2.1 usethis_2.2.2 ## [41] base64enc_0.1-3 Matrix_1.6-1.1 httpuv_1.6.12 igraph_1.5.1 timechange_0.2.0 ## [46] tidyselect_1.2.0 rstudioapi_0.15.0 yaml_2.3.7 miniUI_0.1.1.1 processx_3.8.2 ## [51] pkgbuild_1.4.2 lattice_0.22-5 shiny_1.7.5.1 withr_2.5.2 evaluate_0.23 ## [56] xaringan_0.28 archive_1.1.6 urlchecker_1.0.1 xml2_1.3.5 pillar_1.9.0 ## [61] generics_0.1.3 vroom_1.6.4 hms_1.1.3 munsell_0.5.0 scales_1.2.1 ## [66] ggnetwork_0.5.12 xtable_1.8-4 glue_1.6.2 tools_4.3.2 xaringanExtra_0.7.0.9000 ## [71] miniCRAN_0.2.16 webshot_0.5.5 fs_1.6.3 AMR_2.1.1 grid_4.3.2 ## [76] woRkshoptools_0.1.0 devtools_2.4.5 colorspace_2.1-0 repr_1.1.6 cli_3.6.1 ## [81] kableExtra_1.3.4 fansi_1.0.5 viridisLite_0.4.2 svglite_2.1.2 gtable_0.3.4 ## [86] sass_0.4.7 digest_0.6.33 ggrepel_0.9.4 skimr_2.1.5 htmlwidgets_1.6.2 ## [91] farver_2.1.1 memoise_2.0.1 htmltools_0.5.6.1 lifecycle_1.0.4 httr_1.4.7 ## [96] easypackages_0.1.0 mime_0.12 bit64_4.0.5 ``` ] --- ## `groundhog` [`groundhog`](https://groundhogr.com/) is a lightweight package that allows you to increase the reproducibility of your `R` scripts. It does so by installing and loading "packages & their dependencies as available on chosen date on CRAN". --- ## Using `groundhog` All you need to do to use `groundhog` is specifying the packages you want to use in your script and a date. ```r install.packages("groundhog") library(groundhog) pkgs <- c("tidyverse", "janitor", "sjPlot") groundhog.library(pkgs, date = "2023-11-71") ``` --- ## How `groundhog` works From the [package website](https://groundhogr.com/back-end/): "groundhog relies on a database that contains virtually all package versions ever uploaded to CRAN, the date when they were published, and all dependencies." `groundhog` also used to rely on MRAN. Since that has been retired, however, it now uses its own package repository: [GRAN: Groundhog R Archive Neighbor](https://groundhogr.com/gran/). --- ## How `groundhog` works `groundhog` uses its own package library to install the packages you specified. From the documentation of the `restore.library()` function: "When groundhog installs a package, it installs it into groundhog's library." ```r library(groundhog) get.groundhog.folder() ``` ``` ## [1] "e:\\home/R_groundhog/groundhog_library/" ``` "Groundhog then immediately moves the installed package(s) (and their dependencies) to the default personal library." ```r .libPaths()[1] ``` ``` ## [1] "D:/Programme/R/library" ``` .small[ *Note*: You can find a lot more technical information on the [`groundhog` website](https://groundhogr.com/) and within the help files for the `groundhog` functions. ] --- ## Reversing changes made by `groundhog` You can reverse the changes made by `groundhog` to your default personal `R` package library: ```r restore.library() ``` --- ## Pros and cons of `groundhog` *Pros:* - can be easily used to make existing `R` scripts more reproducible - does not require a "project-based workflow"<sup>1</sup> - does not require any specific knowledge on the reproducer's side *Cons:* - limited to packages from CRAN - works with package snapshots from specific dates, not with specific package versions - in reality, our installed packages are very often not up-to-date - installs specific package versions, but not specific versions of `R` .small[ [1] While you as someone who highly values reproducibility, of course, do use a project-based workflow, the people who want to reproduce your analysis might not 😉 ] --- ## Choosing a date The recommendation by the `groundhog` package authors for choosing a date is that "a good default is the first day of the month when starting your project". Once you have added `groundhog.library()` to your script, re-run it to make sure it produces the expected results. You can also update all of the packages you use and then specify the current date as the date for `groundhog.library()`. --- ## `groundhog` and `R` versions The `groundhog` database also includes information on base `R` releases. As `groundhog` does not install specific versions of `R`, you should still specify the version of `R` you used in your script. ```r R.version.string ``` ``` ## [1] "R version 4.3.2 Patched (2023-10-31 r85451 ucrt)" ``` You can find the release dates for all versions of R via the [CRAN archive page](https://cran.r-project.org/src/base/). --- ## Excursus: Updating `R` As you probably know, you can download the most recent version of `R` from the [CRAN website](https://cran.r-project.org/). You can download older versions of `R` via the CRAN archive packages for [*Windows*](https://cran.r-project.org/bin/windows/base/old/) and [*Mac OS X*](https://cran-archive.r-project.org/bin/macosx/).<sup>1</sup> .small[ [1] How you install and update `R` on Linux depends on your [distribution](https://cran.r-project.org/bin/linux/). ] --- ## Excursus: Updating `R` On Windows, you can also use the [`installr` package](https://talgalili.github.io/installr/) to update `R`.<sup>1</sup> ```r install.packages("installr") library(installr) updateR() ``` .small[ [1] The `installr` package also offers some interesting other functionalities, such as installing `Git` via the `install.git()` function. ] --- ## Excursus: Updating packages The easiest way of updating packages is simply using `install.packages()`. To update all packages or a specific set of packages, you can use the `update.packages()` function. ```r # update all installed packages update.packages() # update specific packages update.packages(oldPkgs = c("tidyverse", "janitor", "sjPlot")) ``` --- ## Excursus: Updating packages With the following code, you can detach and update all of the packages you have currently loaded in your `R` session (excluding core `R` packages): ```r loaded_pkgs <- search() loaded_pkgs <- loaded_pkgs[grep("^package:", loaded_pkgs)] exclude_pkgs <- c("package:base", "package:stats", "package:graphics", "package:grDevices", "package:utils", "package:datasets", "package:methods", "package:utils") loaded_pkgs <- loaded_pkgs[!loaded_pkgs %in% exclude_pkgs] for (pkg in loaded_pkgs) { detach(pkg, character.only = TRUE, unload = TRUE) } update_pkgs <- gsub("^package:", "", loaded_pkgs) install.packages(update_pkgs) ``` --- ## Excursus: Updating packages If you want to install a specific version of a package (not the most recent one), the easiest option is to use a function from the `remotes` package. ```r library(remotes) install_version("tidyverse", version = "1.3.0") ``` --- class: center, middle # [Exercise](https://jobreu.github.io/reproducible-research-gesis-2023/exercises/Exercise_Dependency_Management.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/reproducible-research-gesis-2023/solutions/Exercise_Dependency_Management.html) --- class: center, middle # More comprehensive dependency management options in `R` --- ## `renv` and its advantages<br> compared to `groundhog` Let's talk about another option: `renv` - `renv` is integrated with R Studio - `renv` is developed by professional software engineers (rather than academic researchers) - `renv` works with more repositories than does groundhog For more information on groundhog vs. renv, see: https://groundhogr.com/renv/ --- ## What is `renv`? The package `renv` helps to create reproducible environments for *R projects* --- ## What is an R project? > "R experts keep all the files associated with a project together — input data, > R scripts, analytical results, figures. This is such a wise and common > practice that RStudio has built-in support for this via projects." .small[Source: https://r4ds.had.co.nz/workflow-projects.html] --- ## Why? Use renv to make your R projects more isolated, portable and reproducible. - Isolated: Installing a new or updated package for one project won’t break your other projects, and vice versa. That’s because renv gives each project its own private library. - Portable: Easily transport your projects from one computer to another, even across different platforms. renv makes it easy to install the packages your project depends on. - Reproducible: renv records the exact package versions you depend on, and ensures those exact versions are the ones that get installed wherever you go. .small[Source: https://rstudio.github.io/renv/] --- ## Let's talk about libraries<br> (and packages) - When talking about a "reproducible R environment", this mostly refers to all the R packages that are being used - Most of us have *one* "system library", where all installed R packages can be found (mine can be found here: D:/Programme/R/library) - "The directories in R where the packages are stored are called the libraries" [1] - "Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality" [1] .small[[1] https://hbctraining.github.io/Intro-to-R-flipped/lessons/04_introR_packages.html] --- ## Is one R system library a good idea? However, having *one* central system library is really bad, especially when things break -- and things will break! ("dependency hell") <img src="data:image/png;base64,#https://imgs.xkcd.com/comics/dependency.png" width="38%" style="display: block; margin: auto;" /> <small><small>https://xkcd.com/2347/</small></small> --- ## renv offers "project libraries" - Instead of using *one* central system library for all installed R packages, renv sets up a "project library", i.e., each R project has its own library - Project libraries means that each project has its own independent collection of packages --- ## Where all the R packages grow... repositories - The usual ways to install an R package is to download the package from an repository - Most famous is CRAN, the Comprehensive R Archive Network (other sources include Bioconductor or GitHub) - I am currently using: ```r getOption("repos") ``` ``` ## CRAN ## "https://cran.rstudio.com/" ## attr(,"RStudio") ## [1] TRUE ``` --- ## renv and repositories - renv creates a R project-related library and installs packages into that specific library - Hence, it needs to know where (a) package(s) can be found; not all packages can be found on CRAN --- ## How does renv work? - It is assumed that your work is based on the model of an "R project" (in RStudio) - Similar to Git, you first have to initialize your R project, letting `renv` know that it should take care of all things that are related to a reproducible R environment - At some moment in time, when everything is working fine, and you want to preserve this state of your R project, you can take an R-related "snapshot" of your project - This snapshot can be shared with others or your future self, guaranteeing that you can always restore this moment in time when everything in your R project was working well (meaning: in terms of "reproducible R environment") --- ## Workflow in `renv` <img src="data:image/png;base64,#../img/bw/fig_renv.png" width="100%" style="display: block; margin: auto;" /> --- ## `renv::init()` - `renv::init()` is run once and adds a few new directories and files to your R project - `renv/library`: This directory is the project library that contains all packages currently used by your project - `renv.lock`: Records metadata about every package ensuring that it can be re-installed on a new machine - `.Rprofile`: Runs automatically every time you start R (in that project), renv uses this file to configure your R session to use the project library .small[Source: https://rstudio.github.io/renv/articles/renv.html#getting-started] --- ## `renv::snapshot()` and `renv::restore()` - Running `renv::snapshot()` updates `renv.lock` and documents the current collection (and their specific versions) of R packages that are being used in your R project - To share an R project, you need to provide access to `renv.lock`, `.Rprofile`, `renv/settings.json` and `renv/activate.R` -- or, from a Git perspective, that's all you need to commit into your Git repository - Next, using `renv::restore()` a collaborator can exactly reproduce (download and install specific R packages) your R environment --- ## Installing additional R packages So what? Use `install.packages()` or `renv::install()` --- ## Updating R packages in your R project Use `renv::update()`... --- class: center, middle # More options for dependency management in `R` --- ## And then there is `rang` > "The goal of rang (Reconstructing Ancient Number-crunching Gears) is to > obtain the dependency graph of R packages at a specific time point. > Although this package can also be used to ensure the **current R computational > environment** can be reconstructed by future researchers, this package gears > towards **reconstructing historical R computational environments** which have not > been completely declared. For the former purpose, **packages such as renv, > groundhog, miniCRAN, and Require should be used**. One can think of rang as an > archaeological tool" (emphasis added). .small[Source: https://github.com/gesistsa/rang] --- ## And then there is `rang` (cont.) For more information, see: - https://chainsawriot.github.io/gesis2023_rang/#/title-slide - https://github.com/gesistsa/rang ---